A common approach to determine the cost of products is the should cost method. It consists in estimating what a product should cost based on materials, labor, overhead, and profit margin. Although this strategy is very accurate, it has the drawback of being tedious and it requires expert knowledge of industrial technologies and processes. To get a quick estimation, it is possible to build a statistical model to predict the price of products given their characteristics. With such a model, it would no longer be necessary to be an expert or to wait several days to assess the impact of a design modification, a change in supplier or a change in production site. Before builing a model, it is important to explore the data which is the aim of this case study. This study was commissioned by a cosmetics company that wants to estimate the price of Screw Caps of shampoo bottles.
Let’s first load the database study it’s structure and load the différent packages.
#Loading the different packages for this study
library(dplyr)
le package ‘dplyr’ a été compilé avec la version R 3.4.2
Attachement du package : ‘dplyr’
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
library(readr)
library(ggplot2)
library(FactoMineR)
le package ‘FactoMineR’ a été compilé avec la version R 3.4.2
library(cluster)
library(fpc)
library(factoextra)
Welcome! Related Books: `Practical Guide To Cluster Analysis in R` at https://goo.gl/13EFCZ
library(FactoInvestigate)
le package ‘FactoInvestigate’ a été compilé avec la version R 3.4.2
library(plotly)
Attachement du package : ‘plotly’
The following object is masked from ‘package:ggplot2’:
last_plot
The following object is masked from ‘package:stats’:
filter
The following object is masked from ‘package:graphics’:
layout
QUESTION 1 : Screw Caps Dataset
Now, we load the dataset used for this study:
#Loading the dataset
dataset <- read.table("ScrewCaps.csv", header = TRUE, sep = ",", dec = ".", row.names = 1)
#Printing the dataset
head(dataset)
#Understanding the structure
print(paste0("DB Dimensions: ", dim(dataset)[1]," X " , dim(dataset)[2] ))
[1] "DB Dimensions: 195 X 11"
summary(dataset)
Supplier Diameter weight nb.of.pieces Shape Impermeability
Supplier A: 31 Min. :0.4458 Min. :0.610 Min. : 2.000 Shape 1:134 Type 1:172
Supplier B:150 1st Qu.:0.7785 1st Qu.:1.083 1st Qu.: 3.000 Shape 2: 45 Type 2: 23
Supplier C: 14 Median :1.0120 Median :1.400 Median : 4.000 Shape 3: 8
Mean :1.2843 Mean :1.701 Mean : 4.113 Shape 4: 8
3rd Qu.:1.2886 3rd Qu.:1.704 3rd Qu.: 5.000
Max. :5.3950 Max. :7.112 Max. :10.000
Finishing Mature.Volume Raw.Material Price Length
Hot Printing: 62 Min. : 1000 ABS: 21 Min. : 6.477 Min. : 3.369
Lacquering :133 1st Qu.: 15000 PP :148 1st Qu.:11.807 1st Qu.: 6.161
Median : 45000 PS : 26 Median :14.384 Median : 8.086
Mean : 96930 Mean :16.444 Mean :10.247
3rd Qu.:115000 3rd Qu.:18.902 3rd Qu.:10.340
Max. :800000 Max. :46.610 Max. :43.359
The data ScrewCap.csv contains 195 lots of screw caps described by 11 variables. Diameter, weight, length are the physical characteristics of the cap; nb.of.pieces corresponds to the number of elements of the cap (the picture above corresponds to a cap with 2 pieces: the valve (clapet) is made of a different material); Mature.volume corresponds to the number of caps ordered and bought by the compagny (number in the lot). All the categorical features are Factors. The other features are numerical.
QUESTION 2 : Univariate and bivariate descriptive statistics
Price distribution
d <- density(dataset$Price)
#Plotting the histogram
hist(dataset$Price, breaks=40, probability = TRUE, main = "Price distribution",
xlab = "Price")
#Plotting the density
lines(d, col = "red")
We have here a bimodal distribution and we can describe it in more details with the quantiles:
p <- plot_ly(type = 'box') %>% add_boxplot(y = dataset$Price, jitter = 0.3, pointpos = -1.8, boxpoints = 'all',
marker = list(color = 'rgb(7,40,89)'),
line = list(color = 'rgb(7,40,89)'),
name = "All Points") %>% layout( title = 'Price Boxplot', yaxis = list(title = 'Price'))
p
Using this plotly boxplot we notice that we have 25% of the prices between 6.477451 and 11.807022. 50% between 6.477451 and 14.384413 and 75% between 6.477451 and 18.902429. The remaning 25‰ are data located in a wide range of prices between 18.902429 and 46.610372
Price dependency on length
Let’s study now the price dependency on lenght.
p <- ggplot(data=dataset, aes(x= Length, y= Price)) + geom_point(size=1) + geom_smooth(method=lm) + ggtitle(" Price versus lenght ")
ggplotly(p)
fit_price_lenght <- lm(Price~Length, data=dataset)
summary(fit_price_lenght)
Call:
lm(formula = Price ~ Length, data = dataset)
Residuals:
Min 1Q Median 3Q Max
-13.901 -2.854 -0.741 1.931 16.181
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.94613 0.50918 17.57 <2e-16 ***
Length 0.73168 0.03953 18.51 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.308 on 193 degrees of freedom
Multiple R-squared: 0.6397, Adjusted R-squared: 0.6378
F-statistic: 342.6 on 1 and 193 DF, p-value: < 2.2e-16
We can observe a dependence. 63.9 % of the variability of the price is explained by the lenght.
Price dependency on weight
Now we study the price dependency on weight.
p <- ggplot(data=dataset, aes(x= weight, y= Price)) + geom_point(size=1) + geom_smooth(method=lm) + ggtitle(" Price versus weight ")
ggplotly(p)
fit_price_weight <- lm(Price~weight, data=dataset)
summary(fit_price_weight)
Call:
lm(formula = Price ~ weight, data = dataset)
Residuals:
Min 1Q Median 3Q Max
-14.7993 -2.6207 -0.6631 2.5396 13.8357
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.2275 0.5602 14.69 <2e-16 ***
weight 4.8312 0.2718 17.78 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.419 on 193 degrees of freedom
Multiple R-squared: 0.6208, Adjusted R-squared: 0.6189
F-statistic: 316 on 1 and 193 DF, p-value: < 2.2e-16
We can also observe a dependence. 62 % of the variability of the price is explained by the weight.
Price dependency on Impermeability, Shape and Supplier
Now we will discuss the price dependency on some categorical features such as Impermeability, Shape and Supplier.
p <- plot_ly(type = 'box') %>% add_boxplot(y = dataset$Price , x = dataset$Impermeability, jitter = 0.3, pointpos = -1.8, boxpoints = 'all',
marker = list(color = 'rgb(7,40,89)'),
line = list(color = 'rgb(7,40,89)'),
name = "Price box") %>% layout( title = 'Price versus Impermeability Boxplot', yaxis = list(title = 'Price'))
p
Can't display both discrete & non-discrete data on same axisCan't display both discrete & non-discrete data on same axis
fit_price_impermeability <- lm(Price~ Impermeability, data=dataset)
summary(fit_price_impermeability)
Call:
lm(formula = Price ~ Impermeability, data = dataset)
Residuals:
Min 1Q Median 3Q Max
-16.4106 -3.0187 -0.6286 2.4897 25.0638
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 14.7236 0.4117 35.77 <2e-16 ***
ImpermeabilityType 2 14.5846 1.1986 12.17 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5.399 on 193 degrees of freedom
Multiple R-squared: 0.4341, Adjusted R-squared: 0.4312
F-statistic: 148 on 1 and 193 DF, p-value: < 2.2e-16
The boxplot show us that each impermeability type gather a wide range of prices :
However, we notice a dependence. 43 % of the variability of the price is explained by the impermeability type. Plus, we observe that the price range is statistically different for Type 1 and Type 2. Type 1 is statistically cheaper than Type 2 :
Concerning the price dependency on Shape :
p <- plot_ly(type = 'box') %>% add_boxplot(y = dataset$Price , x = dataset$Shape, jitter = 0.3, pointpos = -1.8, boxpoints = 'all',
marker = list(color = 'rgb(7,40,89)'),
line = list(color = 'rgb(7,40,89)'),
name = "Price box") %>% layout( title = 'Price versus Shape Boxplot', yaxis = list(title = 'Price'))
p
Can't display both discrete & non-discrete data on same axisCan't display both discrete & non-discrete data on same axis
fit_price_shape <- lm(Price~ Shape, data=dataset)
summary(fit_price_shape)
Call:
lm(formula = Price ~ Shape, data = dataset)
Residuals:
Min 1Q Median 3Q Max
-11.098 -3.850 -1.025 3.055 25.587
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 14.2006 0.5406 26.267 < 2e-16 ***
ShapeShape 2 8.1403 1.0782 7.550 1.75e-12 ***
ShapeShape 3 1.4510 2.2777 0.637 0.52485
ShapeShape 4 7.4393 2.2777 3.266 0.00129 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 6.258 on 191 degrees of freedom
Multiple R-squared: 0.2475, Adjusted R-squared: 0.2357
F-statistic: 20.94 on 3 and 191 DF, p-value: 9.008e-12
In this case, it’s hard to notice a dependence between price and shape. Only 24 % of the variability of the price is explained by the Shape type. However, there is some insights :
Concerning the price dependency on Suppliers :
p <- plot_ly(type = 'box') %>% add_boxplot(y = dataset$Price , x = dataset$Supplier, jitter = 0.3, pointpos = -1.8, boxpoints = 'all',
marker = list(color = 'rgb(7,40,89)'),
line = list(color = 'rgb(7,40,89)'),
name = "Price box") %>% layout( title = 'Price versus Impermeability Boxplot', yaxis = list(title = 'Price'))
p
Can't display both discrete & non-discrete data on same axisCan't display both discrete & non-discrete data on same axis
fit_price_supplier <- lm(Price~ Supplier, data=dataset)
summary(fit_price_supplier)
Call:
lm(formula = Price ~ Supplier, data = dataset)
Residuals:
Min 1Q Median 3Q Max
-11.431 -4.491 -1.847 2.873 30.349
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 18.029 1.285 14.033 <2e-16 ***
SupplierSupplier B -1.768 1.411 -1.252 0.212
SupplierSupplier C -3.140 2.303 -1.363 0.174
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.153 on 192 degrees of freedom
Multiple R-squared: 0.01174, Adjusted R-squared: 0.001449
F-statistic: 1.141 on 2 and 192 DF, p-value: 0.3217
There is no dependency on the price. Let’s study the prices in more details :
PriceComp_avg <- dataset %>% select(Supplier,Price) %>% group_by(Supplier) %>% summarise(Average_Price = mean(Price))
head(PriceComp_avg)
PriceComp_min <- dataset %>% select(Supplier,Price) %>% group_by(Supplier) %>% summarise(Minimum_Price = min(Price))
head(PriceComp_min)
PriceComp_avg_price_per_weight <- PriceComp_min <- dataset %>% select(Supplier,Price,weight) %>% group_by(Supplier) %>% summarise(Price_per_weight = mean(Price)/mean(weight))
head(PriceComp_avg_price_per_weight)
In terms of average price, the supplier C is the less expensive. In terms of absolute price, the supplier B is the less expensive. However, Supplier B is also the supplier which has the highest absolute price. *In terms of average price / weight Supplier A has is the less expensive.
QUESTION 3 : Identifying outliers
One important point in exploratory data analysis consists in identifying potential outliers. Let’s identify this outliers given different features. For Mature.Volume variable :
d <- density(dataset$Mature.Volume)
hist(dataset$Mature.Volume, breaks=40, probability = TRUE, main = "Mature Volume distribution",
xlab = "Mature Volume")
lines(d, col = "red")
We can clearly notice here an outlier. We can now remove it
dataset <- dataset %>% filter ( Mature.Volume < 600000 )
Let’s verify the data now :
d <- density(dataset$Mature.Volume)
hist(dataset$Mature.Volume, breaks=40, probability = TRUE,main = "Mature Volume distribution",
xlab = "Mature Volume")
lines(d, col = "red")
After studying the other features distribution, we notice that there is no other outliers. Plus, it seems that every numerical feature have the same trend structure. Please find below the other distributions:
d <- density(dataset$Diameter)
hist(dataset$Diameter, breaks=40, probability = TRUE, main = "Diameter distribution",
xlab = "Mature Volume")
lines(d, col = "red")
d <- density(dataset$weight)
hist(dataset$weight, breaks=40, probability = TRUE, main = "Mature Volume distribution",
xlab = "Distribution")
lines(d, col = "red")
d <- density(dataset$nb.of.pieces)
hist(dataset$nb.of.pieces, breaks=40, probability = TRUE, main = "Nb of pieces distribution",
xlab = "Nb of pieces")
lines(d, col = "red")
d <- density(dataset$Length)
hist(dataset$Length, breaks=40, probability = TRUE, main = "Lenght distribution",
xlab = "Lenght")
lines(d, col = "red")
QUESTION 4/5/6/7 : PCA and correlation matrix
Now we will perform a PCA on the data. A PCA will allows us to fin a low-dimensional reprensation of the data that captures the “essense” of the raw data. Plus, a PCA will allows us denoise the data. This preprocessing and data exploration helps us to better understand/visualise the relations between the differents features and observations and to prepare the data for the prediction process. PCA deals with continuous variables but categorical variables are the projection of the categories at the barycentre of the observations which take the categories.
As we want to predict the price and we have Supplier, Shape, Impermeability and Finishing as qualitative variables, we will consider this last ones as illustrative. Let’s now process the PCA :
res.pca <- PCA(dataset, quali.sup=c(1,5,6,7,9), quanti.sup = 10, scale = TRUE)
summary(res.pca, nbelements = 10)
Call:
PCA(X = dataset, scale.unit = TRUE, quanti.sup = 10, quali.sup = c(1,
5, 6, 7, 9))
Eigenvalues
Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
Variance 3.107 1.067 0.777 0.049 0.000
% of var. 62.142 21.338 15.537 0.976 0.006
Cumulative % of var. 62.142 83.481 99.018 99.994 100.000
Individuals (the 10 first)
Dist Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr cos2
1 | 4.259 | 4.026 2.731 0.894 | -1.247 0.763 0.086 | -0.008 0.000 0.000 |
2 | 4.740 | 4.674 3.681 0.972 | -0.633 0.196 0.018 | 0.466 0.146 0.010 |
3 | 4.739 | 4.662 3.662 0.968 | -0.671 0.221 0.020 | 0.518 0.181 0.012 |
4 | 0.966 | 0.050 0.000 0.003 | 0.451 0.100 0.218 | -0.796 0.427 0.680 |
5 | 1.644 | -0.771 0.100 0.220 | -0.314 0.048 0.036 | 1.408 1.337 0.734 |
6 | 0.802 | -0.507 0.043 0.399 | 0.577 0.163 0.518 | 0.178 0.021 0.049 |
7 | 1.123 | 0.253 0.011 0.051 | -0.185 0.017 0.027 | -1.070 0.772 0.908 |
8 | 1.145 | 0.605 0.062 0.279 | 0.959 0.452 0.702 | -0.154 0.016 0.018 |
9 | 1.153 | 0.622 0.065 0.291 | 0.959 0.451 0.692 | -0.149 0.015 0.017 |
10 | 1.165 | 0.647 0.071 0.309 | 0.958 0.450 0.676 | -0.142 0.014 0.015 |
Variables
Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr cos2
Diameter | 0.985 31.233 0.970 | -0.025 0.061 0.001 | 0.144 2.674 0.021 |
weight | 0.977 30.750 0.955 | -0.029 0.077 0.001 | 0.103 1.371 0.011 |
nb.of.pieces | -0.202 1.309 0.041 | 0.843 66.557 0.710 | 0.499 32.076 0.249 |
Mature.Volume | -0.412 5.458 0.170 | -0.596 33.258 0.355 | 0.690 61.213 0.476 |
Length | 0.985 31.250 0.971 | -0.023 0.048 0.001 | 0.144 2.666 0.021 |
Supplementary continuous variable
Dim.1 cos2 Dim.2 cos2 Dim.3 cos2
Price | 0.796 0.634 | 0.171 0.029 | 0.131 0.017 |
Supplementary categories (the 10 first)
Dist Dim.1 cos2 v.test Dim.2 cos2 v.test Dim.3 cos2 v.test
Supplier A | 0.591 | 0.548 0.859 1.813 | -0.055 0.009 -0.308 | -0.214 0.131 -1.416 |
Supplier B | 0.144 | -0.065 0.206 -0.949 | -0.126 0.759 -3.109 | -0.027 0.035 -0.782 |
Supplier C | 1.675 | -0.444 0.070 -0.976 | 1.441 0.739 5.407 | 0.728 0.189 3.203 |
Shape 1 | 0.496 | -0.426 0.736 -4.859 | -0.138 0.077 -2.687 | -0.214 0.186 -4.888 |
Shape 2 | 1.530 | 1.427 0.870 6.196 | 0.394 0.066 2.921 | 0.383 0.063 3.325 |
Shape 3 | 0.668 | -0.560 0.703 -0.915 | -0.332 0.248 -0.927 | 0.060 0.008 0.195 |
Shape 4 | 1.427 | -0.552 0.150 -0.902 | 0.356 0.062 0.992 | 1.265 0.786 4.138 |
Type 1 | 0.450 | -0.450 1.000 -9.517 | -0.002 0.000 -0.066 | -0.009 0.000 -0.389 |
Type 2 | 3.290 | 3.289 1.000 9.517 | 0.013 0.000 0.066 | 0.067 0.000 0.389 |
Hot Printing | 0.354 | -0.286 0.653 -1.551 | -0.038 0.011 -0.349 | 0.192 0.295 2.083 |
fviz_pca_ind(res.pca)
fviz_pca_ind(res.pca, col.ind="cos2", label=c("quali"), geom = "point") + scale_color_gradient2(low="lightblue", mid="blue", high="darkblue", midpoint=0.6)+ theme_minimal()
Before commenting this graphs, let’s compute also the correlation matrix :
#Scaling the variables:
X <- scale(as.matrix(dataset %>% select(-c(1,5,6,7,9,10))))
#Plotting the correlation matrix:
as.data.frame(cov(X))
The variable factor map shows us that :
This correlations are well explained in the covariance matrix. The cells that correspond to a combinaison of highly correlated features have a cov higher than 0.9 and the cells that correspond to a combinaison of uncorrelated (orthogonal) features have a cov lower than 0.2 - 0.3.
The PCA focuses on the relationships between the continuous variables. In fact, the PCA compute the vectors which are the synthetic variables the most correlated to all the continuous variables. Then, its possible to study the projection of the observations/features on this vectors and discuss the link between them. The issue is that the PCA does not handle categorical variables to the computation of the synthetic vectors.
Let’s now focus on the individual factor map : The barycentre related to \(Impermeability = Type2\) and \(Raw.Material = PS\) are near to the first synthetic axe which means that this two categories Type2 and PS are highly correlated to this axe and then, given the previous analysis to Lenght, weight, price and diameter. In fact, for instance, we have seen that Type 2 have a higher price in average than Type 1. We can say also for example than a PS product is related to a high diameter.
plot(res.pca$eig[,3], type="l", ylab = "Cumulative percentage of inertia", xlab = "Nb of synthetic vectors")
QUESTION 8 : Synthetic variables
The R object with the two principal components which are the synthetic variables the most correlated to all the variables is the two eigen vectors of the PCA linked to the two highest eigenvalues.
as.data.frame(res.pca$var$coord[,1:2])
QUESTION 9
PCA is often used as a pre-processing step before applying a clustering algorithm. In fact, we often perform the CAH or the k-means on the \(k\) principal components to denoise the data. In this setting \(k\) is choosen as large since we do not want to loose any information, but want to discard the last components that can be considered as noise. Consequently, we keep the number of dimensions \(k\) such that we reach 95% of the inertia in PCA. In our case we have \(k=3\) (cf last graph)
QUESTION 10
Let’s now perform a kmeans algorithm on the selected k principal components of PCA.
# We keep the 3 first components of the PCA
dat <- res.pca$ind$coord[,1:3]
#Performing the clustering
clus <- kmeans(dat, 3, nstart = 20)
#Visualizing the clusters
plot(dat, col = clus$cluster, pch = 19, frame = FALSE, main = "K-means with k = 3")
points(clus$centers, col = 1:4, pch = 8, cex = 3)
# Visualizing the total within sum of squares and using "méthode du coude"
fviz_nbclust(dat, kmeans, method = "wss") + geom_vline(xintercept = 3, linetype = 2)
Using “methode du coude” we find that the optimal number of cluster is 3.
QUESTION 11
#Performing a PCA on the 3 principal compenents
res.pca3 <- PCA(dataset, quali.sup=c(1,5,6,7,9), quanti.sup = 10, scale = TRUE, ncp = 3)
# Performing the AHC on the 3 principal components of the PCA
res.hcpc3 <- HCPC(res.pca3, nb.clust = -1)
plot(res.hcpc3$call$t$within[1:14])
QUESTION 12 : Cluster description
The optimal number of clusters calculated by the hcpc function is 3. This function used “method du coude” on call\(t\)within.
We now aim at describing the clusters. First, let’s study the variables which are most important for the partition. The hcpc function compute a fisher test that allows us to evaluate the link between qualitative and quantitative variable. Here : the clusters and our quantitative variables.
as.data.frame(res.hcpc3$desc.var$quanti.var)
We notice here that all our quantitative variables describe well our 3 clusters. The 3 variables that describe very well our 3 clusters (low p-value) are Lenght, Diameter and weight. However, this is linked with our PCA. In fact, this last three variables are very representative of the first dimension of the PCA and then describe well our data. Given the fact that we performed the hcpc of the PCA data, it’s normal to find that these three variables describe well our clusters.
For the qualitative variables, we perform a khi2 test:
res.hcpc3$desc.var$test.chi2
p.value df
Impermeability 5.318642e-18 2
Raw.Material 5.226547e-17 4
Shape 5.626207e-06 6
Supplier 4.102258e-02 4
The variable Impermeability seems to be the most related to the partitioning (lowest p-value)
For the quantitative variables :
res.hcpc3$desc.var$quanti$`1`
v.test Mean in category Overall mean sd in category Overall sd p.value
Mature.Volume 11.942982 2.431183e+05 82206.026178 67166.762125 9.103190e+04 7.064414e-33
Diameter -3.255425 8.214269e-01 1.294639 0.254233 9.821218e-01 1.132228e-03
Length -3.297003 6.491733e+00 10.329589 2.056760 7.864783e+00 9.772253e-04
weight -3.536244 1.100262e+00 1.714121 0.315574 1.172854e+00 4.058595e-04
nb.of.pieces -3.780986 3.324324e+00 4.115183 1.274576 1.413225e+00 1.562083e-04
Price -3.857939 1.245686e+01 16.552332 4.115901 7.172431e+00 1.143473e-04
The Cluster 1 is well defined by all the quantitative variables and more specifically by Mature.Volume. We can guess this result because the cluster 1 ( black points) is situated on the same axe than the Mature.Volume vector on the previous PCA.
res.hcpc3$desc.var$category$`1`
Cla/Mod Mod/Cla Global p.value v.test
Raw.Material=PP 25.000000 97.297297 75.392670 0.0001368886 3.813724
Impermeability=Type 1 22.023810 100.000000 87.958115 0.0049829303 2.808135
Supplier=Supplier C 0.000000 0.000000 7.329843 0.0434829294 -2.019041
Raw.Material=PS 3.846154 2.702703 13.612565 0.0222292828 -2.286427
Raw.Material=ABS 0.000000 0.000000 10.994764 0.0081544536 -2.645607
Impermeability=Type 2 0.000000 0.000000 12.041885 0.0049829303 -2.808135
Shape=Shape 2 2.222222 2.702703 23.560209 0.0002330416 -3.680210
On the same vein, \(Raw.Material=PP\) and \(Shape=Shape 2\) are defining well the \(cluster 1\) given the \(p-value\).
It is also interesting to illustrate the cluster 1 by computing its paragons
res.hcpc3$desc.ind$para$`1`
94 95 142 74 144
0.3280604 0.3376965 0.4205536 0.4265863 0.5886435
The product 94,95,142,74,144 are closest to the centre of the cluster 1 centroid and then, are representative of this cluster.
QUESTION 13
We didn’t choose \(k = 2\) because in this configuration, the two first eigen vectors describe less than 95% of the data. We didn’t choose \(k=4\) because it’s not necessary given the fact that the configuration \(k=3\) is describing already 95% of our data.
A strategy to assess the stability of the approach is to do the same study that we did but for the hcpc with 2 and 4 compenent. Then, we will be able to compare the different p-values for each configuration and see going form 2 to 3 or from 3 to 4 have a significant impact on the results of our clustering. After doing this study we noticed that we have the same p-value results for \(k=3\) and \(k=4\). Plus, we noticed that we have the p-value results are in average better for \(k=3\) than \(k=2\). This insights explain us that 3 components is the best compromise and choice. On the same vein, we noticed that there a no difference in terms of cluster description for a clustering obtained on k components or on the initial data.
QUESTION 14
The methodology that we have used to describe clusters can also be used to describe a categorical variable, for instance the supplier.
catdes(dataset, num.var= 1)
Chi-squared approximation may be incorrectChi-squared approximation may be incorrectChi-squared approximation may be incorrectChi-squared approximation may be incorrect
$test.chi2
p.value df
Raw.Material 9.049049e-05 4
Impermeability 1.088731e-02 2
$category
$category$`Supplier A`
Cla/Mod Mod/Cla Global p.value v.test
Raw.Material=PS 42.30769 37.93103 13.61257 0.0002998155 3.615459
Impermeability=Type 2 34.78261 27.58621 12.04188 0.0130149176 2.483361
Shape=Shape 2 26.66667 41.37931 23.56021 0.0213728107 2.301333
Raw.Material=ABS 0.00000 0.00000 10.99476 0.0254288561 -2.234825
Impermeability=Type 1 12.50000 72.41379 87.95812 0.0130149176 -2.483361
$category$`Supplier B`
Cla/Mod Mod/Cla Global p.value v.test
Raw.Material=ABS 100.00000 14.18919 10.99476 0.003330616 2.935453
Raw.Material=PS 57.69231 10.13514 13.61257 0.015928453 -2.410551
Shape=Shape 2 60.00000 18.24324 23.56021 0.002374481 -3.038894
$category$`Supplier C`
Cla/Mod Mod/Cla Global p.value v.test
Raw.Material=PP 9.722222 100 75.39267 0.01626019 2.403023
$quanti.var
Eta2 P-value
nb.of.pieces 0.2137072 1.530822e-10
$quanti
$quanti$`Supplier A`
NULL
$quanti$`Supplier B`
v.test Mean in category Overall mean sd in category Overall sd p.value
nb.of.pieces -2.817845 3.959459 4.115183 1.240523 1.413225 0.004834708
$quanti$`Supplier C`
v.test Mean in category Overall mean sd in category Overall sd p.value
nb.of.pieces 6.345875 6.428571 4.115183 1.720228 1.413225 2.211654e-10
attr(,"class")
[1] "catdes" "list "
We will describe here the Supplier categorical variable but the method is the same for the others variables. Using the results of catdes, we can interpret the different suppliers as classes. We notice first that Raw.Material and Impermeability describe well the “supplier clustering”. If we deep dive into the different suppliers insights we can for example notice that the Supplier A is characterized by (in order of importance) : Raw.Material=PS; Impermeability=Type 2,1 ;Shape=Shape 2 and Raw.Material=ABS. This variables best describe the Supplier A. Then, we see that this data is very insightfull. In fact, doing a catdes on supplier for example allows the user to do a benchermarking of the different suppliers.
QUESTION 15
To simultaneously take into account quantitative and categorical variables in the clustering we will use the HCPC function on the results of the FAMD ones. FAMD stands for Factorial Analysis of Mixed Data and is a PCA dedicated to mixed data.
res.famd = FAMD(dataset, ncp = 10, sup.var = c(10) )
res.famd$eig
eigenvalue percentage of variance cumulative percentage of variance
comp 1 4.6069336 32.906669 32.90667
comp 2 1.6700311 11.928794 44.83546
comp 3 1.4684324 10.488803 55.32426
comp 4 1.2771099 9.122213 64.44648
comp 5 1.0019966 7.157119 71.60360
comp 6 0.8935056 6.382183 77.98578
comp 7 0.7712154 5.508681 83.49446
comp 8 0.6765127 4.832233 88.32669
comp 9 0.6046173 4.318695 92.64539
comp 10 0.4570087 3.264348 95.90974
This FAMD analysis has many impacts on the different results. First, the overall correlation between variables decrease (in fact, FAMD analysis takes into account more variables). We have to take 10 components (against 3 for the PCA) to reach 95% in terms of cumulative variance. Moreover, this analysis show us that the Price is uncorrelated with Supplier, Finishing, nb.of.pieces and Mature.Volume. However, Impermeability and Raw material directly influence the price of the product. Shape is partially correlated with the price.
res.hcpc <- HCPC(res.famd, nb.clust = -1, graph = FALSE)
Chi-squared approximation may be incorrectChi-squared approximation may be incorrectChi-squared approximation may be incorrectChi-squared approximation may be incorrectChi-squared approximation may be incorrect
plot.HCPC(res.hcpc, choice = "map", draw.tree = FALSE, select = "drawn", title = '')
We see here that adding more variables in the analysis reduce the quality of the clustering in comparaison with the last one.
QUESTION 16
Let’s now perform a model to predict the Price :
library(gbm)
index_test <- sample(x=1:nrow(dataset), size=floor(0.3*nrow(dataset)))
testing_data_set <- dataset[index_test,]
training_data_set <- dataset[-index_test,]
fit <- lm(Price ~ Diameter + Length + weight + Raw.Material + Impermeability + Shape, data= training_data_set)
summary(fit)
Call:
lm(formula = Price ~ Diameter + Length + weight + Raw.Material +
Impermeability + Shape, data = training_data_set)
Residuals:
Min 1Q Median 3Q Max
-13.9119 -2.0775 -0.4518 1.9397 8.3672
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.12367 1.25116 8.091 4.61e-13 ***
Diameter 8.71035 13.83976 0.629 0.53026
Length -0.59981 1.72778 -0.347 0.72906
weight 0.61077 1.05438 0.579 0.56346
Raw.MaterialPP -0.77658 1.07197 -0.724 0.47016
Raw.MaterialPS -2.60612 1.41244 -1.845 0.06741 .
ImpermeabilityType 2 7.41997 1.63463 4.539 1.32e-05 ***
ShapeShape 2 -0.02204 0.94265 -0.023 0.98139
ShapeShape 3 0.48874 2.20456 0.222 0.82492
ShapeShape 4 5.71577 1.72660 3.310 0.00122 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.69 on 124 degrees of freedom
Multiple R-squared: 0.776, Adjusted R-squared: 0.7598
F-statistic: 47.74 on 9 and 124 DF, p-value: < 2.2e-16
predicted_price <- predict.lm(fit,testing_data_set)
real_price <- testing_data_set$Price
Prediction <- data_frame(predicted_price,real_price) %>% mutate(difference = abs(predicted_price-real_price))
The model generated here by keeping only variables that explain the price allows us to have an Adjusted R-squared of 75.98 % (for this specific training dataset generated)
p <- ggplot(Prediction, aes(x= real_price)) + geom_point(aes(y= predicted_price)) + geom_line(aes(y= real_price))
p
This graph shows us the points that are overestimated/underestimated for the testing dataset. It seems that in this example low prices are overestimated and high prices are underestimated.
quantile(Prediction$difference)
0% 25% 50% 75% 100%
0.03753141 0.95380762 1.79597941 3.90414311 17.66292398
For the performance, we can say that we estimate with less than 2 centime error 50% of the data and with less than 4 centime error 75% of the data.
The previous analysis can help us interpret this results because it will be possible to link an over/under estimated point with its position in the individual factor map and then understand that this point is not linked with the drivers of the price.
QUESTION 17
It’s not smart to do one model per supplier because :
QUESTION 18
These data contained missing values. One representative in the compagny suggests either to put 0 in the missing cells or to impute with the median of the variables. Comment. For the categorical variables with missing values, it is decided to create a new category “missing”. Comment.
The first idea will change all the structure of our data becaue it will change the correlation between the different variables. The second idea will not fix the problem and will change nothing because this method only renames the “NA” category to a “missing” category.